An Improved Approach to perform Crawling and avoid Duplicate Web Pages

نویسندگان

Dhiraj Khurana

Satish Kumar

Akansha Singh

چکیده

When a web search is performed it includes many duplicate web pages or the websites. It means we can get number of similar pages at different web servers. We are proposing a Web Crawling Approach to Detect and avoid Duplicate or Near Duplicate WebPages. In this proposed work we are presenting a keyword Prioritization based approach to identify the web page over the web. As such pages will be identified it will optimize the web search.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

An Efficient Approach for Near-duplicate page detection in web crawling

The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...

متن کامل

A Survey of Duplicate And Near Duplicate Techniques

--World Wide Web consists of more than 50 billion pages online. The advent of the World Wide Web caused a dramatic increase in the usage of the Internet. The World Wide Web is a broadcast medium where a wide range of information can be obtained at a low cost. A great deal of the Web is replicate or nearreplicate content. Documents may be served in different formats: HTML, PDF, and Text for diff...

متن کامل

The improved Shark Search Approach for Crawling Large-scale Web Data

Web crawling is an important approach for collecting larger-scale web data on, and keeping up with, the rapidly expanding Internet. This paper puts forward the improved shark search approach for crawling large-scale Web data based on link clustering and the technology of tunnel. In this study we focus on the classification of Web links instead of downloaded web pages to determine relevancy whic...

متن کامل

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

An Improved Approach to perform Crawling and avoid Duplicate Web Pages

نویسندگان

چکیده

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

An Efficient Approach for Near-duplicate page detection in web crawling

A Survey of Duplicate And Near Duplicate Techniques

The improved Shark Search Approach for Crawling Large-scale Web Data

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

عنوان ژورنال:

اشتراک گذاری